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Abstract — The motivation for this paper is to apply Bayesian 
structure learning using Model Averaging in large-scale net- 
works. Currently, Bayesian model averaging algorithm is 
applicable to networks with only tens of variables, restrained 
by its super-exponential complexity. We present a novel frame- 
work, called LSBA^(Large-Scale Bayesian Network), making it 
possible to handle networks with infinite size by following the 
principle of divide-and-conquer. 

The method of LSBN comprises three steps. In general, 
LSBN first performs the partition by using a second-order 
partition strategy, which achieves more robust results. LSBN 
conducts sampling and structure learning within each over- 
lapping community after the community is isolated from 
other variables by Markov Blanket. Finally LSBN employs 
an efficient algorithm, to merge structures of overlapping 
communities into a whole. 

In comparison with other four state-of-art large-scale net- 
work structure learning algorithms such as ARACNE, PC, 
Greedy Search and MMHC, LSBN shows comparable results 
in five common benchmark datasets, evaluated by precision, 
recall and f-score. What's more, LSBN makes it possible to 
learn large-scale Bayesian structure by Model Averaging which 
used to be intractable. 

In summary, LSBN provides an scalable and parallel frame- 
work for the reconstruction of network structures. Besides, the 
complete information of overlapping communities serves as the 
byproduct, which could be used to mine meaningful clusters 
in biological networks, such as protein-protein-interaction net- 
work or gene regulatory network, as well as in social network. 



I. Introduction 

Structure learning from sparse data serves as a cen- 
tral problem in a variety of research area, for it uncov- 
ers underlying relationships, dependencies among variables, 
and more importantly, brings forth a structured, easily- 
understood model for further prediction and inference. As 
a major structure learning approach, a Bayesian network 
describes a probabilistic graphical model by representing a 
set of random variables and conditional dependencies via 
a directed acyclic graph (DAG). What's more, a Bayesian 
network provides a very flexible framework to fuse different 



types of data and prior knowledge together to derive a 
synthesized network. 

To achieve more robust and proper results in Bayesian 
structure learning, it is preferable to integrate over all pos- 
sible structure models by using Bayesian model averaging. 
However, with the number of network variables growing, the 
enumeration of all possible structures becomes intractable 
and impractical, for there are overall 0(n!2( 2 )) possible 
structures given n network variables (30). In short, struc- 
ture learning by using model averaging is NP-hard even 
when the maximum parents number of network variable 
is bound to certain constant value ^ lfl9l . However, in real 
applications, ranging from casuality network to Protein- 
Protein-Interaction network, the scales are much larger than 
traditional structure learning by using model averaging could 
support. 

A very natural and logical attempt to scale the Bayesian 
structure learning beyond its limitation of variable numbers 
is to partition the variables into multiples groups, thus 
employing the method separately and efficiently. Manual 
partition is one option ll23ll . yet subjective factors would 
inevitably play a nontrivial role and possibly influence the 
ultimate result. Another widely-applied approach involves 
prior knowledge [25 1, where domain knowledge is exploited 
to distinguish closely related variables thus guide the parti- 
tion. For example, a common application under guidance of 
prior knowledge is Gene Regulatory Network(GRN) infer- 
ence from gene expression data. In this case, cluster analysis 
is frequently applied to find similar functional groups, based 
on the assumption that genes presented by similar expression 
patterns tend to be co-regulated or interact[T8l. 

Unfortunately, the partition strategy is confronted with 
three fundamental limitations. The first problem lies in the 
lack of prior knowledge in most cases. It is neither practical 
nor tractable to collect prior knowledge for a special purpose 
beforehand. The second problem is that even there does 
exist prior knowledge, it remains challenges to quantify the 
knowledge as the network prior distribution. For example, 



if the prior distribution is assigned with higher value, sig- 
nificant bias could be resulted towards the prior knowledge, 
leading to unwarranted structures without paying sufficient 
attention to data. Or the prior distribution is insignificant, 
it won't do help to improve learning results. The third 
problem arises when we attempt to obtain prior knowledge 
directly from data by using statistical measurements, such as 
correlation coefficients 113311 . mutual information! 26]. How- 
ever, these measurements are limited to pairwise information 
so that different measurements would inevitably lead to 
different partition results. 

In this paper, we propose a novel framework LSBN 
(Large-Scale Bayesian Network) to learn Bayesian structure 
for sufficiently large networks. The basic idea of our 
framework is motivated by the philosophy of divide-and- 
conquer. Specifically, LSBN recursively partitions network 
variables into multiple communities with much smaller 
sizes, learns intra-community variables respectively before 
merge them altogether again. Our contributions lie in three 
aspects: 

• We propose a robust partition algorithm, called 
'ROPART', to segment large-scale network variables into 
multiple overlapping communities with much smaller 
sizes. According to the traditional graph clustering problem, 
whether variables are allocated into the same group depends 
on their edge weight, denoting the interrelated closeness 
among each other. Therefore, how to measure edge weights 
among variables is a challenge. Common measurements, 
such as mutual information! 26], pearson coefficient] 33 1, 
show different partition results given the same data. No 
one dominate the other, for each one performs excellent in 
some datasets and dissatisfactory in others. So ROPART 
introduces a second-order partition strategy, to overcome 



this shortcoming in a robust way. (Section IV-A I 



• We propose a sampling strategy to generate smaller 
sub-communities when current community is still too large 
to perform practical Bayesian structure learning. Also, 
we figure out how to isolate the dependencies of intra- 
community variables from those outside the community. 
The isolation makes intra-community structure learning 
unbiased and credible. What's more, we categorize and 
analyze primary types of error edges from structure learning, 
and apply a uniform strategy to resolve the problem with 



satisfactory results. (Section IV-B i 



• We propose an efficient algorithm, called 'MERGENCE' 
to merge the intra-community results into a whole. 
MERGENCE involves seeking an efficient mergence order, 
and resolves conflicts during the process of mergence. 
(Section [rV^C] ) 

We benchmark evaluation of LSBN on five well-known 



datasets, in comparison with four state-of-art structure learn- 
ing algorithms. The compared results reveal that LSBN 
achieves comparable results to state-of-art structure learning 
algorithms, meanwhile, LSBN makes it possible to learn 
Bayesian structure by Model Averaging which used to be 
intractable in large-scale network. 

II. Related Work 

The problem of static Bayesian Structure Learning is 
well-studied. Exclusive of Bayesian Model Averaging, the 
four major approaches are information-theoretic, constrain- 
based, score-and-search and hybrid. And several representa- 
tive algorithms would be included in performance evaluation 
(Section [V) 

The first major approach to Bayesian Structure Learn- 
ing is based on Information Theory Models. They weigh 
network edges by correlation coefficients or statistic scores 
derived from Mutual Information! 18 1, such as RELNET| 12|, 
ARACNEII261 and CLR||22]. Though most of information- 
theoretic approaches are subject to unweighted networks, an 
asymmetric variation of Mutual Information measurement 
could be employed to obtain directed networks! 29 1. The 
advantages of Information Theory Models lie in extreme 
simplicity and low computational cost. However, such mod- 
els could only take into consideration pairwise relationship 
rather than multiple variables at the same time. Another 
drawback is such models usually require plenty of obser- 
vation data for the sake of accuracy. 

The second major approach is constrain-based algorithm. 
Specifically, constrain-based algorithms use conditional in- 
dependence (CI) tests to reveal the target DAG, such as 
PC 1 3 1 1 and RAI[38|. The drawback of such algorithms is 
that number of requisite CI tests grows exponentially with 
the number of variables, so polynomial complexity could 
only be ensured by imposing the maximum parents number. 
Besides, such algorithms inevitably miss or wrongly identify 
V-structures, which would affect the orientation of edges and 
even subsequent stages. 

The third major approach performs a score-and-search 
strategy. In general, score-and-search algorithms search 
through structure space guided by a scoring function. One 
of the most basic score-and-search algorithms is Greedy 
Search[ 10]. Since the size of structure space grows super- 
exponentially, the search approach would get inevitably 
trapped into local maximum, even there are many ways 
to escape, such as random restarts, simulated annealing or 
search in the space of equivalence classes of DAGs, called 
PDAGs. 

The forth major approach serves as a hybrid approach. 
Hybrid approaches integrate constrain-based and score-and- 
search algorithms together. MMHC|36) (Max-Min Hill- 
Climbing) shows superiority to other algorithms by com- 
bining local learning, reconstructing the skeleton of a 



Bayesian network by constrain-based approach, and per- 
forming greedy hill-climbing search for edge orientation. 

Great amounts of work has been devoted to detecting 
overlapping communities in large networks. The most pop- 
ular algorithm is Clique Percolation Method (CPM)ll28l 
which computes all k cliques and two variables belong to 
the same cluster if there exists a path going through k-1 
cliques between them. CPM is implemented by CFinder[4| 
(http://www.cfinder.org/). Besides, overlapping communities 
detection could be roughly categorized into threefold: opti- 
mization, clustering and partitioning. 

One of traditional methods regards overlapping commu- 
nities detection as an optimization problem, specifically, 
each community is identified as a subgraph reaching local 
optimization given quality function W, thus detecting over- 
lapping communities becomes finding all locally-optimized 
subgraphs[7|. Furthermore, the optimization could be aug- 
mented by combination with spectral mapping and fuzzy 
clustering 1 39]. 

Clustering approaches define clusters as either the set of 
nodes or the set of edges, and then perform the clustering 
according to the similarity among nodes ll27l or edgesQ, 
respectively. Besides, clustering could be conducted in an 
agglomerative hierarchy, for example, LinkComm[5| merges 
groups of edges pairwise in descending order of edge 
similarity and consequently achieve a dendrogram. 

Partitioning approaches transform original graph into a 
larger graph without overlapping nodes before conduct tra- 
ditional partitions. Those overlapping nodes are identified 
and split into multiple copies of themselves beforehand] 17 1. 
The identification of candidate overlapping nodes is based 
on split betweenness\l6\, and the splitting process continues 
as long as the split betweenness of variables is sufficiently 
high. 

III. Definitions 

A. DEFINITION 1 (Weight Function) 

Given a generic undirected, unweighted graph Q = 
(V,E), where E £ V x V. Weight Function / maps any 
edges e=(n,i))ei?toa numeric value: 

f-.E^R, (u,v)^f(u,v) (1) 

Generally speaking, weight functions play the role to 
map an unweighted graph into a weighted one. Given 
a weight function set consisting of n weight functions, 
F = {fx, /„}, a generic unweighted graph G could be 
mapped to a weighted graph set D — {G±, ...G n }, where 
Gi = (V,Ei). 

B. DEFINITION 2 (Partition) 

Given an undirected, weighted graph Q — (V,E), P = 
{pi, ...,pc} is a partition of the edges into C communities. 
Each node v G V belongs to at least one communities, 
even an isolated node would itself constitute a community 



of single member. Communities as the result could be 
partitioned again recursively thus grouped into a hierarchical 
structure. 

C. DEFINITION 3 (Partition Support Matrix) 

Given a weighted graph set D = {G±, ...G n }, where 
Gi = (V,Ei), each weighted graph Gi corresponds to a 
partition Pj. The partition support matrix of a node v in 
partition Pi, written PSMi(v), is of column | V| and of row k 
where |V| is the node number of graphs and k is the number 
of communities in partition Pi that contains node v. The 
element in i-th row and j-th column of PSMi(v) denotes 
whether j-th node exists in i-th communities which contains 
node v. The partition support matrix of a node v for all 
partitions, written PSM(v), is defined as: 



PSM(v) = 



( PSNh(v) \ 
PSM 2 {v) 

V PSM n (v) J 



(2) 



For example, the partition support matrix of the 
nodej for partition P\ in Figure [lj is PSMx(node>i) = 
[[0, 0, 0, 1, 0, 0, 1, 0, 0] T , [0, 0, 0, 0, 0, 0, 1, 1, 1] T ] T , while 
for partition P 2 in Figure [TJ? is PSM2(nodej) = 
[0, 0, 0, 1, 0, 0, 1, 1, 1]. Then the overall partition support 
matrix of noder is 



PSM(node 7 



0,0,0,1,0,0,1,0,0 
0,0,0,0,0,0,1,1,1 
0,0,0,1,0,0,1,1,1 



As one can see, node-; tends to be grouped together with 

node^, nodes ar, d nodeg. 

D. DEFINITION 4 (Second-Order Network) 

Given a weighted graph set D = {G±, ...G n }, where 
Gi = (V,Ei), and its corresponding partition set 
P = {Pi, ...P n }, the second-order network in an undi- 
rected, weighted graph S — (V,Es) where an edge e = 
(u, v) £ Es if its edge weight exceeds threshold. The edge 
weight is valued by co-occurrence probability in partition 
support matrix between u and v. 

E. DEFINITION 5 (Second-Order Partition) 

A second-order partition is the partition P sec of a second- 
order network. 

iv. lsbn system: large-scale bayesian network 
Learning 

The LSBN system provides a novel framework for 
Bayesian structure learning using model averaging in 
large-scale networks. The LSBN system proposes a divide- 
and-conquer strategy to segment the originally intractable 
Bayesian structure learning tasks into multiple tractable 
sub-tasks with a much smaller scale. The workflow of 




Algorithm 1 ROPART 



Figure 1. Two different partition on the same set of variables 



LSBN system is as follows: 

(1) Variable Partition. (Section |TV-Al l 

(2) Sampling and Learning. (Section YV-B\ 

(3) Mergence. (Section |IV-Q 



A. Variable Partition 

The LSBN system is expected to perform well even the 
number of variables increases. Variable partition serves as 
a crucial step by drastically reducing the complexity and 
run-time of learning in our next stage. Partition incorporates 
overlapping, that is, each node may belongs to more than 
one community. Ideally, a perfect partition should possess 
three properties: selectiveness, so that nodes within the 
communities have much higher probability to possess correct 
edges among each other than outside of the communities; 
high coverage, so that all correct edges are embodied within 
at least one community, in other words, partition doesn't 
break any correct edges; and fine granularity, so that each 
community is small enough to be applicable for Bayesian 
structure learning algorithms. 

The initial input of the LSBN system requires discrete- 
state data. If the data is continuous, discretization is 
necessary before any further steps. We develop a novel 
partition algorithm, called 'ROPART', to construct robust 
partition of all variables. ROPART consists of five steps, as 
outlined in Algorithm [T] and illustrated in Figure [2] 



1 : build an undirected, unweighted complete graph Q given 
all variables; 

2: for each weight function /, from a predefined weight 
function set F — {/i,.. .,/„}, map Q to a weighted 
graph Qi respectively; 

3: for each weighted graph 5i,keep those edges of which 
the weight exceed some truncate threshold Ttrunc, after 
that, generate partition Pj respectively; 

4: for each variable v, construct its corresponding partition 
support matrix PSM(v); 

5: construct second-order network and generate second- 
order partition; 



In step 1, We start with a fully-connected, undirected, 
unweighted graph including all variables. 

In step 2, We employ a predefined weight function set 
F = {fx, ...,/ n } to translate original unweighted graph into 
a set of various weighted graphs D — {Qx, ...Q n }, where 
fi corresponds to Qi, The predefined weight functions are 
shown as follows: 



Mutual Information 



mi(x,y) = Y d Y. p ^y) l °9 



P(x,y) 
P(x)P(y) 



[37| 



Mutual Information normalized by the sum of entropies 



MI plus (X,Y) 



2MI(X,Y) 



H(X) + H(Y) 
• Mutual Information normalized by the square root 



product of entropies 



MI sqrt (X,Y) 



MI(X,Y) 



MI pr (X,Y) = 



• Mutual Information normalized by PageRank weight 

MI(X, Y) 

where PR(X) and PR(Y) are PageRank values of node X 
and Y respectively. 

• Mutual Information by Standard Normalization 

MI(X,Y)-fi MI 



MI sn (X,Y) 



0~MI 



where Umi and a mi denotes the mean value and standard 
deviation of all edge weights, which are valued by mutual 
information. 

• Pearson Correlation Coefficient 

E[(X - fj, x )(Y - » Y )} 

(JXUY 



PX,Y = 
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Figure 2. ROPART: a robust partition strategy combining multiple different weight functions 



where and /iy denote the mean value of variable X 
and Y respectively, and ax and ay indicate the standard 
deviation of X and Y correspondingly. 

Each weighted graph Qi derived by weight function /j 
is pruned by removing edges whose weight is lower than 
some truncate threshold Ttrunc- 



based on partition support matrix set PSM = 
{PSM(vi), ...PSM(v\v\)} and generate second-order 
partition P sec . The resulting second-order partition will 
satisfy the criteria for a good partition: (1) the communities 
are of high cohesion and low coupling; and (2) the partition 
is robust to the selection of weight function. 



In step 3, we partition each weighted graph Qi after 
pruning. For the sake of convenience in Mergence(Section 



IV-Cl, we prefer communities to be organized hierarchi- 
cally. Meanwhile, for the sake of high coverage, we wish 
communities to maintain pervasive overlaps. Here we use 
LinkComm[5] for partition algorithm, which is introduced 
in Section HTl 

In step 4, We construct partition support matrix 
PSM(v) for each variable v, based on partition result 
set P = {Pi,...P n } generated from step 3, where Pi 
corresponds to weighted graph Qi. 

In step 5, We build second-order network 



B. Sampling and Learning 

The LSBN system performs Bayesian structure learning 
per community, however, to learn the structure correctly 
poses many challenges. For example, what if an outlier 
variable is erroneously mixed into a community, how to 
pinpoint such variables, and what's most important, how to 
eliminate the learning bias caused by irrelevant variables 
in the same community? What's more, what if correct 
edges suffer disconnection during partition, is there any 
mechanism to retrieve those missing edges? We first attempt 
to find out candidate Markov Blanket given nodes within 
community, after that, we consider a sampling methodology 
to address mentioned challenges. The detailed learning 



algorithm consists of four steps, as outlined in Algorithm [2] 
and illustrated in Figure [3] 



Algorithm 2 LocalLearn 
1: for each community C E P sec do 
2: identify candidate Markov Blanket MB(C) 
3: sample k sub-communities based on C and MB(C), 

SC = {sub 1 (C),...,sub k (C)} 
4: for each sub-community scj E SC do 
5: learn Bayesian structure 

6: end for 

7: ensemble structures of all sub-communities and re- 
solve conflicts 
8: end for 



Stepl. Find potential markov blanket Step2. Sampling 




(a) (b) 



Step3. Learning per sub-community Step4. Ensemble Learning Result 




(c) (d) 

Figure 3. LocalLearn: a sampling strategy for bayesian structure learning 
per community 

In Step 1, we try to find out the Markov Blanket 
MB(C) for community C. Conditioned on Markov Blanket, 
no other nodes outside the community C could influence 
nodes within the community. To be more precise, VAT E 
C,X ± Y\MB(X) for all Y ^ {X U MB(X)}. Therefore 
knowledge of Markov Blanket MB(C) is enough to infer 
intra-community structure, thus makes the learning problem 
localized and easier to be solved. 

According to definition, the Markov Blanket of variable 
X is composed of all parents of X, all children of X and all 
parents of X's children. In other words, Markov Blanket of X 
should be closer to X topologically than any other variables 
in networks. 

Since the Markov Blanket of variable X is composed 
of all parents of X, all children of X and all parents of 
X's children, in other words, Markov Blanket of X should 
be closer to X topologically than any other variables 



in structure. Besides, topological closeness is related to 
significance in edge weights, thus edge weights play an 
important role in identifying Markov Blanket. In addition, 
conditional independence (CI) tests and measurement 
of association among variables are required for further 
justification. The identification of Markov Blanket is divided 
into two steps: 

(1) for variable X, we achieve its Markov Blanket 
candidates by looking for adjacent nodes whose 
edge weight exceeds the average. To be precise, 
MB cand (X) = {Y\w X y > w x }, where w X y is the 
weight of edge exy and Wx is the average weight of edges 
connecting X. 

(2) Given MB can d{X), we employ a scalable algorithm 
IAMB[35] for further justification. IAMB[35](Incremental 
Association Markov Blanket) consists of a forward phase 
and a backward phase. In the forward phase, IAMB applies 
a heuristic approach to find an estimated Markov Blanket 
set CMB. CMB starts with an empty set, then iteratively 
added variables Y from MB can( j(J) which maximizes a 
heuristic function f(Y\X\CMB). The heuristic function 
f(Y;X\CMB) denotes the mutual information between 
candidate Markov Blanket node Y and target node X given 
CMB, which is informative and efficient. In the backward 
phase, conditional independence tests are used within CMB, 
and invalid variables are removed from CMB one by one if 
it is independent of X given the remaining CMB. 

In Step 2, we combine each community C with its 
Markov Blanket MB(C) into a expanded community C 
The size of C would be much smaller than expectation, 
for intra-community variables are highly correlated due to 
partition. Chances are high that members of Markov Blanket 
is embodied in community C as well. 

Given expanded community C, we conduct sampling for 
two reasons: (1) the size of C may still be too large for prac- 
tical Bayesian structure learning; (2) bootstrap contributes to 
more consistent and robust results. Our sampling algorithm 
borrows the idea of Random Node Neighbor (RNN)|24| with 
some modifications. 

In our sampling algorithm, we build up an inner markov 
graph Gim(C) based on C, to be precise, Gim(C) — (C,E) 
where e XY = 1 if X € MB(Y) or Y E MB(X), and 
otherwise. Then we uniformly pick an unvisited node 
from inner markov graph Gim{C) as starting node at 
random together with its neighbors, denoted by S. The 
final sub-community for Bayesian structure learning is 
S U MB(S), if the size of ultimate sub-community is still 
too large to learn, we keep removing neighbors within S 
until acceptable size. 

In Step 3, for each sub-community, LocalLearn use 



Bayesian model averaging to learn the structure of nodes 
within the sub-community. Bayesian model averaging is 
different from other structure learning algorithms, such as 
constrain-based approach and score-and-search approach, 
for it is interested in the confidence of some structure- 
related features rather than structure topology per se. In 
this case, feature f(G) = 1 denotes whether there exists 
an edge from node i to node j and f(G) = otherwise. 
Usually posterior expectation of feature f(Q) is estimated 
by: 

P(f\D)=J2.f(G) P (G\D) (3) 

G 

f(G) is classified to be 1 if E(f\D) exceed certain 
threshold T aV g and otherwise. However, traversing all 
candidate DAGs according to Eq.(|3]l is usually intractable 
for there are overall 0(n!2(")) DAGs given n nodes. One 
solution is to average over structures which conform to 
some predetermined node order -< |fl3l . For example, if 
Xj € Paa(Xj) then i -< j. Now the posterior expectation 
of feature f(G) given a predetermined order -< could be 
transformed into[15|: 

P{f\D)=Y^P{<\D)P(f\D^) (4) 

-< 

Usually the order -< is unobtainable due to little prior 
in current domain, so there are n\ possible orders in all to 
be taken into consideration given n nodes, which remains 
intractable. So we use MCMC (Markov Chain Monte Carlo) 
techniques to sample orders, and sample DAGs consistent 
with that order! 15 1 1141 . Assuming a uniform prior over 
orders, a Markov Chain M. is constructed with state space 
consisting of all n\ possible orders. This Markov Chain 
Ai is simulated by conforming to stationary distribution 
P(-< \D), and a sequence of sampled order -<!i, -<t 
is obtained. Now the expectation of feature f could be 
approximately estimated as lTT31 : 

F(/|£)«if>(/|A<) (5) 
t=i 

After learning sub-community structures, we integrate 
them into a uniform intra-community structure and resolve 
conflicts. We investigate the characteristics of error edges 
and find two major types of error: 

• additional edges due to indirect interactions. This type of 
error is introduced by ARACNEll26l. ARACNEll26l tries to 
eliminate indirect edges by using Data Processing Inequality 
in all triplets of variables. However, ARACNE fails to 
eliminate all indirect edges and keep all valid edges at the 
same time for two reasons: (1) it is hardly to find fixed 
threshold to distinguish indirect interactions from direct 
ones; (2) In many cases, weakest edges in triplets do not 
necessarily indicate indirect interaction. 



• missing edges due to weak edge weight. This type of 
error originates from the adjacent nodes, maybe some nodes 
have relatively small PageRank value, in other words, they 
are periphery nodes; Or maybe some nodes are hubs, so 
the mutual information with its neighbors, that is, the edge 
weight, is relatively small. 

To tackle these two major types of error, we first collect 
all candidate triplets of nodes. A candidate triplet contains 
three nodes, mutually-connected by the edge whose weight 
exceeds certain threshold value. Chances are high that these 
candidate triplets contain indirect interactions [ 26 1. We build 
up a undirected, unweighted graph based on edges from 
candidate triplets. Then we cluster this graph into sparsely- 
connected dense subgraphs. Ideally, on one hand, indirect 
interactions are clustered with direct interactions. Since di- 
rect interactions have more significant edge weights, indirect 
interactions would be eliminated through re-learning. On 
the other hand, weak correct edges are expected to be 
grouped with weaker wrong edges. As a result, they survive 
through re-learning due to relative significance. Here we still 
employ LinkComm[5] for graph clustering. After clustering 
the candidate triplet graph, we re-learn the structure for each 
cluster using the same method in Step 3. 

C. Mergence 



Algorithm 3 Mergence 
1: Create a leaf node for every community and add it to 

the Community Pool; 
2: while there is more than one node in community pool 

do 

3: Remove nodes Cj and Cj of maximum Jaccard 
similarity coefficient J(Ci,Cj) from the Community 
Pool; 

4: Merge structures of two communities Ci and C$; 
5: Create a new node C new = Ci 1J Cj denoting the 

mergence of two communities Cj and Cf, 
6: Add the new node to the Community Pool; 
7: end while 



The LSBN system combines structures after learning in- 
dividually from each community. The combination involves 
two concerns: (1) to find an efficient mergence order; (2) 
to resolve the conflicts during the mergence. Intuitively, the 
mergence strategy should proceed as a bottom-up approach. 
The LSBN system would keep piecing together all intra- 
community structures into larger structures, block by block 
until a whole structure is achieved. We expect the structure 
to be increasingly accurate and wrong edges would be 
continuously eliminated during the mergence. 

By borrowing the idea of Huffman's Algorithm[20|, the 
LSBN system tries to merge communities in a greedy 
strategy by constantly pick two communities with max- 
imum Jaccard similarity coefficient!]]]. Jaccard similarity 



coefficient! 1 1 of two communities is proportional to their 
overlap and inversely proportional to their union size. We 
propose Mergence Algorithm to perform such greedy strat- 
egy in order to combine the structures of each community 
into whole better, and the detailed algorithm is outlined in 
Algorithm [3] 

At first, Mergence Algorithm put all communities into 
a Community Pool, then repeatedly choose two commu- 
nities Ci and Cj with largest Jaccard similarity coefficient 
J{Ci,Cj) — yc J | • Afto* two communities are selected, 
same approach described in step 3 



IV-B 



of Section 

applied to resolve the conflicts by clustering triplets. Then 
Mergence Algorithm combine these two communities into a 
new hybrid community, and put it into the Community Pool 
for further mergence steps. 

In each iteration, assuming there are k communities left 
in the Community Pool, then it would take („) times of 
calculation to select the maximum value of Jaccard similarity 
coefficient. If there are n communities initially, the total 
calculation sums up to be: 53fc=2 (2) = HS=2 ^ — 
Sfc=2 ~ "t"+ 1 K"- 1 ) _ p or me sa j, e f computational 
efficiency, the values of Jaccard similarity coefficient could 
be calculated in advance. For each iteration, assuming there 
are k communities left in the Community Pool, removing 
old values would take only 2(fc — 1) + 1 times and adding 
new values would take k — 2 times of calculation. After 
computational optimization, the overall calculation shrinks 
to: © + EL 2 [2(fc - 1) + 1 + (fc - 2)] = 2n(n 1). 

V. Experiments 

We benchmark evaluation of LSBN on five well-known 
datasets. We expect the structures learned by LSBN to be 
close enough to those learned by other Bayesian structure 
learning algorithms. Closeness in results indicates that par- 
tition and local learning in LSBN hardly cause any losses. 
In addition, since LSBN is designed to work on Bayesian 
structure learning problem in large-scale network, we expect 
LSBN to learn structures of which the sizes exceed the 
computational upper bound of traditional Bayesian model 
averaging approach. 

A. Datasets 

• alarm J8). The alarm network consists of 37 random 
variables and 46 arcs, with average degree of network being 
2.49, maximum in-degree being 4 and average Markov 
Blanket size being 3.51. 

• insurance J9|. The insurance network contains 27 
random variables and 52 arcs, with average degree of 
network being 3.85, maximum in-degree being 3 and 
average Markov Blanket size being 5.19. 

• win95pts 0. The win95pts network includes 76 
random variables and 112 arcs, with average degree of 



network being 2.95, maximum in-degree being 7 and 
average Markov Blanket size being 5.92. 

• pigs E). The pigs network includes 441 random 
variables and 592 arcs, with average degree of network 
being 2.68, maximum in-degree being 2 and average 
Markov Blanket size being 3.66. 

• link ETI . The link network embodies 724 random 
variables and 1125 arcs, with average degree of network 
being 3.11, maximum in-degree being 3 and average 
Markov Blanket size being 4.8. 

From each benchmark network, we sampled 20000 in- 
stances as the observed data. Besides, the pre-defined weight 
function set includes: (1) mutual information, abbreviated 
as 'MI'; (2) mutual information normalized by the sum of 
entropies, abbreviated as 'MJ p j us '; (3) mutual information 
normalized by the square root product of entropies, abbrevi- 
ated as , Ml sqr t\ (4) mutual information normalized by the 
PageRank values, abbreviated as 'MI pr '; (5) mutual infor- 
mation after standard normalization, abbreviated as 'MI sn '; 
(6) Pearson Coefficient in absolute value, abbreviated as 
'Pearson'; (7) absolute Pearson Coefficient after standard 
normalization, abbreviated as ' Pear son sn ' . 

B. Parameter Setting 

For each weighted network generated by certain weight 
function, the truncate threshold Ttrunc f° r pruning is chosen 
by the Elbow Method] 34 1. Specifically, we first transform all 
edge weights into histogram, and each bin in the histogram 
denotes the frequency of edges weights falling in certain 
range. Then we look at the variance descent between each 
pair of adjacent bins. For example, the first bin will add 
much information (encompass a lot of variance), for the 
majority of the edges possess relatively very small weights. 
Yet at certain bin, the variance ratio slows down. And we 
choose that bin as the truncate threshold Ttrunc, also called 
'elbow criterion'. The truncate thresholds we chosen are 
shown in Table [IJ the percentage denotes the ratio between 
remaining edges and all edges in complete graph. From 
the results in Table [I] there is no significant difference in 
numbers of remaining edges. 

We measure the average shortest path (Table [III) an d diam- 
eter (Table III 1 for each partition, in comparison of weighted 



network partitions and second-order partition. There are 
several noticeable phenomenon in these results. First, there 
is no dominating weight function, for the partition result of 
its weighted network performs excellent in some datasets 
but poor in others. For example, mutual information proves 
to be the best weight function in alarm according to average 
shortest path, yet among the worst ones in win95pts. Second, 
as expected, second-order partition achieves more stable 
results. The results always belong to the best ones and never 



oscillate drastically. The average ranking for second-order 
partition in average shortest path is 2.4 (Table and the 
average ranking in diameter is 2. 

The partition results are depicted in Table IV and the 
partition size distribution reveals that the size of the majority 
of communities ranges less or equal than 25 (100% in alarm, 
100% in insurance, 100% in win95pts, 92.827% in pigs and 
95.181% in link). The average community size is 10.286 in 
alarm, 9.8 in insurance, 8.345 in win95pts, 9.527 in pigs and 
8.904 in link. 

C. Experimental Design 

Our evaluation benchmarks LSBN four state-of-art 
large-scale network structure learning algorithms, namely 
ARACNE___, PCEQ, Greedy Search] 10 1 and Max-Min 
Hill Climbing (MMHC)|36|. Among these common al- 
gorithms, ARACNE|26] is a very popular information- 
theoretic algorithm with extreme simplicity and low com- 
putational cost; the PC algorithm! 22 is considered as the 
most popular constrain-based algorithm; Greedy Search lfTOl 
are very widely-used score-and-search approaches; and 
MMHC|36| serves as a hybrid method which proves to be 
superior to other algorithms in most cases. 

We compare LSBN to these four state-of-art algo- 
rithms with respect to its correctness in structures. In 
implementation of PC, Greedy Search and MMHC, we 
were aided by Causal Explorer Toolkit[6| (http://www.dsl- 
lab.org/causal_explorer/) and structural results in the format 
of directed edges are evaluated by ourselves. The parameters 
used in Causal Explorer Toolkit for each algorithm are just 
set up as default, for example, threshold on statistical test is 
5% in default for MMHC and Greedy Search; threshold on 
mutual information test is 1% in default for PC; prior type is 
chosen to be BDeu score] 11] with Dirichlet Weight equals 
10 for Greedy Search and MMHC. As for ARACNE, we 
implement the algorithm in person due to its simplicity, and 
the low values threshold r are selected manually based on 
tradeoff between true positives and false positives, shown in 
Table __ 

Table V 

LOW VALUES THRESHOLD T IN ARACNE IMPLEMENTATION FOR ALL 
DATASETS 



Dataset Name 


low values threshold r 


alarm 


0.01 


insurance 


0.05 


win95pts 


0.025 


pigs 


0.01 


link 


0.005 



It's impropriate to make comparison between LSBN and 
other algorithms directly for the result of LSBN is deter- 
mined by two factors: (1) the performance of Bayesian 
model averaging on five benchmark datasets; (2) the per- 
formance of LSBN to divide-and-conquer Bayesian model 



averaging. Due to the intractability of Bayesian model 
averaging, we expect to evaluate the framework of LSBN 
itself per se. If the performance of LSBN is proved to be 
satisfactory, we would conclude that LSBN could be well- 
applied to Bayesian model averaging as well. 

As for the evaluation of LSBN framework, we slightly 
change LSBN to make more fair comparison between 
ARACNE, PC, Greedy Search and MMHC. Specifically, 
we replace the Bayesian model averaging process(Step 3 in 
Algorithm. [2]) with corresponding targeted structure learning 
algorithm. For example, given MMHC algorithm as com- 
parison target, we would use a modified version of LSBN 
whose structure learning algorithm in LocalLearn is also 
MMHC while keep everything else unchanged. As a result, 
the performance of LSBN framework could be measured 
independently, regardless of influence brought from the 
usage of different structure learning algorithms. 

D. Performance Evaluation 

As for the evaluation of structure learning, we regard 
the Bayesian structure learning problem as a binary clas- 
sification problem. For each pair of nodes, the Bayesian 
structure learning algorithm either assigns a positive label or 
a negative label to declare whether there is an edge existing 
between them or not. 

We use precision, recall and F-score as our metrics to 
evaluate the performances of LSBN system. These metrics 
are defined as follows: 

Precision = TP /{TP + FP) 
Recall = TP /(TP + FN) 

2 * Precision * Recall 
Precision + Recall 



F — Score = 



Where TP(True Positive) is the number of positive edges 
correctly classified as positive, corresponding to the hit- 
ting edges; FP(False Positive) is the number of negative 
edges mistakenly classified as positive, corresponding to the 
additional edges (error edges); TN(True Negative) is the 
number of negative edges correctly classified as negative; 
and FN(False Negative) is the number of positive edges mis- 
takenly classified as negative, corresponding to the missing 
edges. 

The comparison results between LSBN and other state- 
of-art algorithms such as ARACNE, PC, Greedy Search 



and MMHC are depicted in Table VI on Alarm Dataset, 
Table |VII| on Insurance Dataset, Table |VIII| on Win95pts 
Dataset, Table IX on Pigs DataSet and Table [X] on Link 
DataSet. The hit edge number, additional edge number 
and missing edge number as well as corresponding metrics 
such as Precision, Recall and F-Score of network results 
reconstructed LSBN are shown versus their counterparts 
generated by the algorithms performed in global space. Note 
that Bayesian model averaging, abbreviated as 'Model Avg', 



Table I 

Truncate threshold for each weight function on each dataset 





insurance 


alarm 


win95pts 


Pi 


gs 


link 


Ttrunc 


percent 


Ttrunc 


percent 


Ttrunc 


percent 


Ttrunc 


percent 


Ttrunc 


percent 


MI 


0.08 


0.2308 


0.01 


0.2267 


0.007 


0.1659 


0.005 


0.0841 


0.035 


0.0566 


MI p i us 


0.2 


0.2792 


0.03 


0.2312 


0.05 


0.1635 


0.05 


0.0958 


0.4 


0.0435 


Mlgqrt 


0.5 


0.3646 


0.5 


0.2132 


0.4 


0.1618 


2 


0.0726 


2.5 


0.0493 


MIp r 


0.05 


0.3276 


0.02 


0.2267 


0.013 


0.1687 


0.005 


0.0762 


0.5 


0.0510 


MI sn 





0.2707 





0.2072 


0.05 


0.1368 


0.8 


0.0861 


0.6 


0.0441 


Pearson 


0.1 


0.4017 


0.4 


0.2747 


0.02 


0.1894 


0.026 


0.1516 


0.075 


0.0585 


Pearson Bn 





0.3789 





0.2087 


0.75 


0.1175 





0.0731 


1.0 


0.0490 



Table II 

Comparison of averaging shortest path for various weight functions 





insurance 


alarm 


win95pts 


pigs 


link 


ranking 


MI 


1.5333(3) 


3.8162(1) 


2.9799(6) 


5.2616(4) 


2.6427(4) 


3.6 


MI p i us 


1.55(4) 


4.1567(8) 


3.1130(8) 


5.4789(5) 


2.4552(1) 


5.2 


MI sqrt 


1.9456(8) 


3.8792(3) 


3.0156(7) 


5.6248(7) 


2.4622(2) 


5.4 


MI pr 


1.7983(6) 


3.8567(2) 


2.5521(1) 


5.1297(3) 


3.7409(6) 


3.6 


MI sn 


1.4927(2) 


3.9682(7) 


2.8828(5) 


5.5749(6) 


4.4982(8) 


5.6 


Pearson 


1.7357(5) 


3.8991(5) 


2.7558(2) 


4.4463(2) 


2.6801(5) 


3.8 


Pearsorisn 


1.8430(7) 


3.9178(6) 


2.8380(4) 


5.6633(8) 


4.0882(7) 


6.4 


second-order 


1.36(1) 


3.8875(4) 


2.8240(3) 


4.28198(1) 


2.5624(3) 


2.4 



Note: The number within the parentheses denotes the ascending ranking of current weight function in certain dataset. 



Table III 

Comparison of averaging Diameters for various weight functions 





insurance 


alarm 


win95pts 


pigs 


link 


ranking 


MI 


3.0(3) 


6.1667(2) 


5.1429(6) 


10.6875(4) 


5.0026(8) 


4.6 


MIpl us 


3.0(3) 


6.8(7) 


5.4(8) 


11.2247(5) 


4.3666(4) 


5.4 


MIsqrt 


4.0(8) 


6.8333(8) 


5.1667(7) 


11.672(8) 


4.3480(3) 


6.8 


iw Jpr 


3.6(6) 


6.2(3) 


4.05(1) 


9.4707(3) 


4.6027(6) 


3.8 


MIsn 


2.7143(2) 


6.7143(6) 


4.6857(4) 


11.4805(7) 


4.4982(5) 


4.8 


Pearson 


3.33(5) 


6.1(1) 


4.5556(2) 


8.3585(2) 


4.9513(7) 


3.4 


Pearson sn 


3.778(7) 


6.625(5) 


4.886(5) 


11.3907(6) 


4.0882(2) 


5 


second-order 


2.423(1) 


6.4211(4) 


4.619(3) 


7.7565(1) 


3.7409(1) 


2 



Note: The number within the parentheses denotes the ascending ranking of current weight function in certain dataset. 



Table IV 

Partition Size Distribution for all Datasets 





1-5 


6-10 


11-15 


16-20 


21-25 


26-30 


31-35 


36-40 


41-45 


46-50 


>50 


alarm 


3 


2 


1 


2 























insurance 


1 


3 


6 


























win95pts 


19 


19 


13 


4 























pigs 


102 


76 


24 


11 


7 


6 


6 


1 


1 


1 


2 


link 


81 


47 


16 


5 


9 


3 





2 


3 









is intractable in global scope, so the relevant blankets are 
filled with 'NA'. 

The precision of LSBN shows little inferiority to global in 
benchmark algorithms such as ARACNE, PC and MMHC, 
but superiority in Greedy Search (Table [XQ , The recall of 
LSBN shows superiority to global in ARACNE and PC, 
but little inferiority in Greedy Search and MMHC (Table 



XII I. The F-Score serves as the harmonic mean of precision 
and recall, which shows comparable results to global in 



ARACNE, PC, Greedy Search and MMHC (Table \KM$ . The 
results of precision, recall and F-Score reveal that LSBN per 
se does not introduce noticeable errors in the procedure of 



partition, sampling, intra-community structure learning and 
mergence. What's more, in some cases, LSBN even improve 
the learning quality. 



As for our target, Bayesian model averaging, there is no 
comparison result available. By referring to other algorithms, 
it performs well in datasets such as Alarm, Insurance, 
Win95pts and link. Despite the significant disparity in Pigs, 
the results learned by using Bayesian model averaging is 
close to the results learned by other state-of-art algorithms 
in most cases. 



Table VI 

Evaluation of LSBN against other algorithms on Alarm Dataset 





LSBN 


Global 


Hit(TP) 


Miss(FN) 


Error(FP) 


Precision 


Recall 


F-Score 


Hit(TP) 


Miss(FN) 


Error(FP) 


Precision 


Recall 


F-Score 


ARACNE 


39 


7 


6 


86.667 


84.783 


85.714 


31 


15 


4 


88.571 


67.391 


76.543 


PC 


38 


8 





100 


82.609 


90.476 


35 


11 





100 


76.087 


86.420 


Greedy 


43 


3 


4 


82.979 


84.783 


83.871 


44 


2 


9 


83.019 


95.652 


88.889 


MMHC 


43 


3 


3 


93.478 


93.478 


93.478 


44 


2 


1 


97.778 


95.652 


96.703 


Model Avg 


43 


3 


8 


84.314 


93.478 


88.660 


NA 


NA 


NA 


NA 


NA 


NA 



Table VII 

Evaluation of LSBN against other algorithms on Insurance Dataset 





LSBN 


Global 


Hit(TP) 


Miss(FN) 


Error(FP) 


Precision 


Recall 


F-Score 


Hit(TP) 


Miss(FN) 


Error(FP) 


Precision 


Recall 


F-Score 


ARACNE 


33 


19 


4 


89.189 


63.462 


74.157 


25 


27 


2 


92.593 


48.077 


63.291 


PC 


36 


16 


3 


92.308 


69.231 


79.121 


31 


21 


1 


96.875 


59.615 


73.810 


Greedy 


41 


11 


7 


87.179 


65.385 


82.000 


47 


5 


11 


81.034 


90.385 


85.455 


MMHC 


43 


9 


4 


91.489 


82.692 


86.869 


43 


9 


2 


95.556 


82.692 


88.660 


Model Avg 


45 


7 


7 


86.538 


86.538 


86.538 


NA 


NA 


NA 


NA 


NA 


NA 



Table VIII 

Evaluation of LSBN against other algorithms on Win95pts Dataset 





LSBN 


Global 


Hit(TP) 


Miss(FN) 


Error(FP) 


Precision 


Recall 


F-Score 


Hit(TP) 


Miss(FN) 


Error(FP) 


Precision 


Recall 


F-Score 


ARACNE 


81 


31 


39 


67.500 


72.321 


69.828 


53 


59 


8 


86.885 


47.321 


61.272 


PC 


64 


48 


8 


88.889 


57.143 


69.565 


38 


74 


3 


92.683 


33.929 


49.673 


Greedy 


99 


13 


143 


40.909 


88.393 


55.932 


94 


18 


106 


47.000 


83.929 


60.256 


MMHC 


92 


20 


56 


62.162 


82.143 


70.769 


90 


22 


32 


73.770 


80.357 


76.923 


Model Avg 


98 


14 


93 


51.309 


87.500 


64.686 


NA 


NA 


NA 


NA 


NA 


NA 



Table IX 

Evaluation of LSBN against other algorithms on Pigs Dataset 





LSBN 


Global 


Hit(TP) 


Miss(FN) 


Error(FP) 


Precision 


Recall 


F-Score 


Hit(TP) 


Miss(FN) 


Error(FP) 


Precision 


Recall 


F-Score 


ARACNE 


592 





14 


97.690 


100.00 


98.831 


592 


15 


4 


99.831 


100.00 


99.916 


PC 


574 


18 





100.00 


96.959 


98.456 


591 


1 


8 


98.664 


99.831 


99.244 


Greedy 


570 


22 


13 


97.770 


96.284 


97.021 


592 





47 


92.645 


100.00 


96.182 


MMHC 


574 


18 


2 


99.653 


96.959 


98.288 


592 








100.00 


100.00 


100.00 


Model Avg 


447 


145 


940 


32.228 


75.507 


45.174 


NA 


NA 


NA 


NA 


NA 


NA 



Table XI 

Significance of LSBN' Precision normalized by the precision 

OF GLOBAL SITUATION FOR FOUR STATE-OF-ART STRUCTURE 
LEARNING ALGORITHMS. 



Table XII 

Significance of LSBWRecall normalized by the Recall of 
global situation for four state-of-art structure learning 
algorithms. 



Dataset 


ARACNE 


PC 


Greedy 


MMHC 


alarm 


97.850% 


100% 


99.952% 


95.602% 


insurance 


96.324% 


95.286% 


107.583% 


95.744% 


win95pts 


77.689% 


95.906% 


87.040% 


84.265% 


pigs 


97.855% 


101.354% 


105.532% 


99.653% 


link 


90.542% 


86.014% 


150.694% 


99.756% 



Dataset 


ARACNE 


PC 


Greedy 


MMHC 


alarm 


125.808% 


108.572% 


88.637% 


97.727% 


insurance 


132.001% 


116.130% 


72.341% 


100% 


win95pts 


152.831% 


168.419% 


105.319% 


102.223% 


pigs 


100% 


97.123% 


96.284% 


96.959% 


link 


95.044% 


665.734% 


52.746% 


76.328% 



VI. Conclusion networks, called LSfiJV(Large-Scale Bayesian Network). In 

In this paper we present a novel framework for Bayesian general, The framework follows the principle of divide-and- 
structure learning using Model Averaging in large-scale conquer by partitioning variables into multiple overlapping 



Table X 

Evaluation of LSBN against other algorithms on Link Dataset 





LSBN 


Global 


Hit(TP) 


Miss(FN) 


Error(FP) 


Precision 


Recall 


F-Score 


Hit(TP) 


Miss (FN) 


Error(FP) 


Precision 


Recall 


F-Score 


ARACNE 


422 


703 


338 


55.526 


37.511 


44.775 


444 


681 


280 


61.326 


39.467 


48.026 


PC 


466 


659 


277 


62.719 


41.422 


49.893 


70 


1055 


26 


72.917 


6.222 


11.466 


Greedy 


413 


712 


342 


54.702 


36.711 


43.936 


783 


342 


1374 


36.300 


69.600 


47.715 


MMHC 


474 


651 


321 


59.623 


42.133 


49.375 


621 


504 


418 


59.769 


55.200 


57.394 


Model Avg 


408 


717 


413 


49.695 


36.267 


41.932 


NA 


NA 


NA 


NA 


NA 


NA 



Table XIII 

Significance of LSBATF-Score normalized by the F-Score of 
global situation for four state-of-art structure learning 
algorithms. 



Dataset 


ARACNE 


PC 


Greedy 


MMHC 


alarm 


111.982% 


104.693% 


94.355% 


96.665% 


insurance 


117.168% 


107.196% 


95.957% 


97.980% 


win95pts 


113.964% 


140.046% 


92.824% 


92.000% 


pigs 


98.914% 


99.206% 


100.872% 


98.288% 


link 


93.231% 


435.139% 


92.080% 


86.028% 



communities, learning intra-community structures individ- 
ually and merging them together. Specifically, LSBN first 
performs the partition by using a second-order partition 
strategy, called ROPART, which is verified to achieve more 
robust results. Then LSBN proposes a learning algorithm, 
named LocalLearn, to conduct sampling and structure learn- 
ing within each overlapping community after the community 
is isolated from other variables by Markov Blanket. Finally 
LSBN employs an efficient algorithm, called MERGENCE, 
to merge structures of overlapping communities into a 
whole. 

In comparison with other four state-of-art large-scale 
network structure learning algorithms such as ARACNE, PC, 
Greedy Search and MMHC, LSBN shows comparable results 
in five common benchmark datasets, evaluated by precision, 
recall and f-score. What's more, LSBN makes it possible 
to learn large-scale Bayesian structure by Model Averaging 
which used to be intractable. 

In summary, LSBN provides an scalable and parallel 
framework for the reconstruction of network structures. Be- 
sides, the complete information of overlapping communities 
serves as the byproduct, which could be used to mine 
meaningful clusters in biological networks, such as protein- 
protein-interaction network or gene regulatory network, as 
well as in social network. 
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